ROCm 与 HIP：深入详解 10 章教程：GPU 性能的内存中心本质

在 GPU 加速中，我们必须摒弃“计算优先”的思维。现代性能主要由 内存管理决定：即主机（CPU）与设备（GPU）之间数据分配、同步和优化的协调。

尽管 GPU 的算术吞吐量（$TFLOPS$）已急剧上升，但内存带宽（$GB/s$）的增长却慢得多。这造成了一个鸿沟，执行单元经常处于‘饥饿’状态，等待来自显存的数据到达。因此， GPU 编程往往就是内存编程。

该模型直观展示了 算术强度 （FLOPs/Byte）与性能之间的关系。应用程序通常分为两类：

性能的主要瓶颈很少是数学运算本身；而是通过 PCIe 总线或从高带宽内存（HBM）移动一个字节所带来的延迟和能耗成本。高性能代码更注重数据驻留，尽量减少主机与设备之间的数据传输。

TERMINALbash — 80x24

> Ready. Click "Run" to execute.

QUESTION 1

What is the primary cause of a GPU kernel being 'memory-bound'?

The clock speed of the GPU cores is too slow.

The rate of data delivery is slower than the rate of arithmetic execution.

There are too many threads running in parallel.

The CPU is faster than the GPU.

QUESTION 2

In the context of GPU programming, what does 'Memory Management' involve?

Only allocating variables on the CPU stack.

Controlling allocation, synchronization, and optimization of data transfer between host and device.

Optimizing the cache size of the L1 controller.

Manually cleaning the GPU registers after every kernel call.

QUESTION 3

Which axis of the Roofline Model represents 'Arithmetic Intensity'?

Vertical Axis (Y)

Horizontal Axis (X)

The slope of the line.

The area under the curve.

QUESTION 4

Why is redundant host-device transfer considered a 'performance tax'?

It consumes GPU registers.

Latency and energy consumption of moving data across PCIe is significantly higher than instruction execution.

It increases the floating-point precision error.

It causes the GPU to overheat instantly.

QUESTION 5

If a researcher's kernel spends 95% of its time 'stalled,' what is the most likely culprit?

The math instructions are too complex.

Inefficient orchestration of data residence causing the GPU to wait for data.

The GPU has too much VRAM.

The kernel was written in C++ instead of Python.